Introduction
Seaborn is a Python plotting package built on matplotlib and automates many of the code that is required for statistical plots. In this notebook, we look at typical plots that you may want to use in your own work.
Packages used in this notebook
There are various plotting themes available in the seaborn package. Themes create an overall look to plot. More about the set_style that sets the plot styles can be found here. We set the argument of the set_style function to 'whitegrid' for this notebook.
Data
We use the heart_failure.csv spreadsheet file again in this notebook and import it as a pandas dataframe object assigned to the variable df.
The column headers (variables) are listed using the columns attribute.
Index(['age', 'anemia', 'creatinine_phosphokinase', 'diabetes',
'ejectionfraction', 'hypertension', 'platelets', 'serum_creatinine',
'sodium', 'sex', 'smoking', 'time', 'death'],
dtype='object')
Plots are selected based on the data type of the variable(s) that we want to visualize. The seaborn package divides its high-level plots into three categories, shown in the image below.
Relational plots
Relational plots visualize the correlation between continuous variables.
The seaborn package lists scatter plots and line plots as relational plots. In Figure 1, we use the scatterplot function to create a scatter plot of the \texttt{age} vs. the \texttt{platelets} variables to visualize the correlation between these two continuous variables.
The correlation between continuous variables can be visualized by the unique elements of a categorical variable. The hue argument can be set to a categorical variable. Each class (unique element) for the categorical variable will be colored differently. We could also add the style argument and set the value to a categorical variable. This adds more visual contrast, by using different marker styles for each class of the categorical variable. In Figure 2 we specify both argument and set the values to the \texttt{anemia} variable.
Instead of using the classes of a categorical variable, we can visualize a third continuous variable, using the hue argument. When the variable is continuous, the value determine the color of the markers. In Figure 3, we add the values of the \texttt{sodium} variable.
The size argument uses marker size instead of color to visualize a third continuous variable. In Figure 4 choose the size argument instead of the hue argument for \texttt{sodium}.
Distribution plots
Histograms can be used to visualize the frequency of data values. A histogram generates intervals (bins) and counts the occurrences of continuous values in each interval.
The displot function can produce a histogram. In Figure 5, we plot a default histogram of the \texttt{age} variable.
In Figure 6, we specify the bin intervals using the range function. The start value is 40, then end values is 110, and the step size is 10. This will produce a histogram along the age decades, which is more user-friendly to view.
In Figure 7, we show a histogram for each of the classes in the \texttt{anemia} variable using the hue argument.
Overlaying histograms can be difficult to visualize. If we are only interested in the combined frequency, but still want to visualize the proportions. we can use the multiple argument set to 'stack' to produce a stacked histogram, shown in Figure 8 below.
It may be better to produce separate histograms for each of the classes. We achieve this use the col argument. In Figure 9 we have separate histograms for each class of the \texttt{anemia} variable.
The stat argument can be set to probability to visualize the relative frequency instead of the frequency. In Figure 10, we see the relative frequency version of Figure 9 above.
Heat maps can be used to visualize the distribution of two continuous variables. We add a y argument to the displot function to visualize bivariate distributions. In Figure 11 we visualize the \texttt{age} and the \texttt{platelets} variables. The cbar argument adds a color bar.
A kernel density estimate adds smoothing to produce heat areas. We add the kind argument in Figure 12 and set it to 'kde'. We also add a rug plot (tick marks along the axes for each observation) using the rug argument with a value of True.
The jointplot function can combine different visualizations of the same data, by adding plots in the margins of the plot figure. The default in Figure 13 adds a histogram to a scatter plot given two continuous variables.
We can take more control over the joint plot and the marginal plots by assigning a JointGrid object to a variable and adding the plot_joint and plot_matginals methods to the JointGrid object. In Figure 14, we add box-and-whisker plots to the margins.
Categorical plots
Although box-and-whisker plots visualize the distribution of a continuous variable, seaborn lists it as plot for categorical data. It is often used to compare the distribution of a continuous numerical variable between the unique elements of a categorical variable. The catplot function is used for a variety of plots for categorical data.
In Figure 15, we see a box-and-whisker plot of the \texttt{age} variable, for those with and without diabetes (unique elements in the \texttt{diabetes} variable). The kind argument is set to 'box', to indicate a box-and-whisker plot.
In Figure 16 we add a second categorical variable, \texttt{death}, using the hue argument.
For larger data sets, the 'boxen' value for the kind argument, gives a better indication of the distribution of the values of the continuous variable. Figure 17 visualizes the same data as Figure 16, setting the kind argument to 'boxen'.
Violin use a kernel density estimate to create the shape of the plots. This gives an even richer visualization of the distribution of the continuous variable. In Figure 18, we revisit the data of Figure 16, but as a violin plot, setting the kind argument to 'violin'.
To reduce the number of shapes, we can split the violin plots by the unique values of a binary variable such as \texttt{death}, using the split argument. This is shown in Figure 19 below.
A number of other visualizations can be created using catplot. In Figure 20, we see a swarm plot, which is a type of categorical scatter plot. The kind argument is set to 'swarm'.
Bar plots are quintessential plots for categorical data, showing the frequency of the unique elements of a categorical variable. In Figure 21, we see the frequency of those with and without diabetes. This is achieved by setting the kind argument to 'count'.
The last type of categorical plot that we consider in this notebook is the point plot. It visualizes the difference in the mean of a numerical variable for the unique elements of a categorical variable, A point plot also shows the 95\% confidence interval around the mean. We can also use the hue argument to shown individual point plots. In Figure 22, we visualize the difference in the \texttt{age} variable between those with and without diabetes, for each of the survivors and non-survivors. The two categorical variables being \texttt{diabetes} and \texttt{death}.
Setting plot and axes labels
A simple way to add a plot title and the axes labels, is to add the set method to any plot. The set method contains the arguments, xlabel, ylabel, and title. Each argument takes a string as value. In Figure 23, we add a title and labels for both axes.